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Abstract 

Sensor  devices  and  embedded  processors  arc  becoming  ubiquitous,  especially  in  measurement  and 
monitoring  applications.  Automatic  discovery  of  patterns  and  trends  in  the  large  volumes  of  such  data 
is  of  paramount  importance.  The  combination  of  relatively  limited  resources  (CPU,  memory  and/or 
communication  bandwidth  and  power)  poses  some  interesting  challenges.  We  need  both  powerful  and 
concise  “languages”  to  represent  the  important  features  of  the  data,  which  can  (a)  adapt  and  handle 
arbitrary  periodic  components,  including  bursts,  and  (b)  require  little  memory  and  a  single  pass  over 
the  data. 

This  allows  sensors  to  automatically  (a)  discover  interesting  patterns  and  trends  in  the  data,  and  (b) 
perform  outlier  detection  to  alert  users.  We  need  a  way  so  that  a  sensor  can  discover  something  like 
“the  hourly  phone  call  volume  so  far-  follows  a  daily  and  a  weekly  periodicity,  with  bursts  roughly 
every  year-,”  which  a  human  might  recognize  as,  e.g.,  the  Mother’s  day  surge.  When  possible  and  if 
desired,  the  user  can  then  issue  explicit  queries  to  further  investigate  the  reported  patterns. 

In  this  work  we  propose  AWSOM  (Arbitrary  Window  Stream  modeling  Method),  which  allows 
sensors  operating  in  remote  or  hostile  environments  to  discover  patterns  efficiently  and  effec¬ 
tively,  with  practically  no  user  intervention.  Our  algorithms  require  limited  resources  and  can 
thus  be  incorporated  in  individual  sensors,  possibly  alongside  a  distributed  query  processing  en¬ 
gine  [CCC+02,  BGS01,  MSHR02].  Updates  arc  performed  in  constant  time,  using  sub-linear  (in 
fact,  logarithmic)  space.  Existing,  state  of  the  art  forecasting  methods  (AR,  SARIMA,  GARCH,  etc) 
fall  short  on  one  or  more  of  these  requirements.  To  the  best  of  our  knowledge,  AWSOM  is  the  first 
method  that  has  all  the  above  characteristics. 

Experiments  on  real  and  synthetic  datasets  demonstrate  that  AWSOM  discovers  meaningful  patterns 
over  long  time  periods.  Thus,  the  patterns  can  also  be  used  to  make  long-range  forecasts,  which  arc 
notoriously  difficult  to  perform  automatically  and  efficiently.  In  fact,  AWSOM  outperforms  manually 
set  up  auto-regressive  models,  both  in  terms  of  long-term  pattern  detection  and  modeling,  as  well  as 
by  at  least  10  x  in  resource  consumption. 


1  Introduction 


Several  recent  applications  produce  huge  amounts  of  data  in  the  form  of  a  semi-infinite  stream  of 
values  [GKMS01,  GKS01,  DGGR02,  GG02].  Formally,  a  stream  is  a  time  sequence  of  numbers 
Xo,Xi , . . . ,  Xi, . . .  like  samples  or  measurements  at  discrete  time  ticks. 

Time  sequences  have  attracted  a  lot  of  attention  in  the  past  [BJR94],  for  forecasting  in  financial, 
sales,  environmental,  ecological  and  biological  time  series,  to  mention  a  few.  However,  several  new 
and  exciting  applications  have  recently  become  possible. 

The  emergence  of  cheap  and  small  sensors  has  attracted  significant  attention.  Sensors  are  small  de¬ 
vices  that  gather  measurements — for  example,  temperature  readings,  road  traffic  data,  geological  and 
astronomical  observations,  patient  physiological  data,  etc.  There  arc  numerous,  fascinating  applica¬ 
tions  for  such  sensors  and  sensor  networks,  such  as:  (a)  health  care,  with  potentially  wearable  sensors 
monitoring  blood  pressure,  heart  rate,  temperature  etc.  and  detecting  patterns  and  abnormalities,  (b) 
industrial  applications,  keeping  track  of  manufacturing  process  parameters  and  production,  (c)  civil 
infrastructure,  with  sensors  embedded  in  bridges  and  highways  [CGNOO]  monitoring  vibrations  and 
material  deterioration,  (d)  road  traffic  conditions  and  safety,  (e)  smart  houses  and  elderly  care. 

Although  current  small  sensor  prototypes  [HSW+00]  have  limited  resources  (512  bytes  to  128Kb 
of  storage),  dime-sized  devices  with  memory  and  processing  power  equivalent  to  a  modern  PDA  arc 
not  far  away.  In  fact,  PDA-like  devices  with  data  gathering  units  arc  already  being  employed  in  some 
of  the  above  applications  (such  as  highway  and  industrial  monitoring).  The  goal  in  the  next  decade 
is  single-chip  computers  with  powerful  processors  and  2-10Gb  [CBF+00,  SGNGOO]  of  nonvolatile 
storage. 

Furthermore,  embedded  processors  arc  becoming  ubiquitous  and  their  power  has  yet  to  be  har¬ 
nessed.  A  few  examples  of  such  applications  are:  (a)  intelligent  ( active )  disks  [RFGNOO]  that  learn 
common  input  traffic  patterns  and  do  appropriate  prefetching  and  buffering,  (b)  intelligent  routers  that 
can  monitor  data  traffic  and  simplify  network  management. 

From  now  on,  we  use  the  term  “sensor”  broadly,  to  refer  to  any  embedded  computing  device  with 
fairly  limited  processing,  memory  and  (optionally)  communication  resources  and  which  generates  a 
semi-infinite  sequence  of  measurements. 

The  resource  limitations  unavoidably  imply  the  need  for  certain  trade-offs — it  is  impossible  to 
store  everything.  Furthermore,  we  would  like  to  make  the  most  of  available  resources,  allowing  the 
sensor  to  adapt  and  operate  without  supervision  for  as  long  as  possible. 

This  is  the  problem  we  address  in  this  work.  The  goal  is  a  “language”  (i.e.,  model  or  representation) 
for  efficient  and  effective,  automatic  stream  mining.  We  want  to  collect  information,  in  real-time  and 
without  any  human  intervention,  to  answer  questions  such  as  “what  is  the  typical  temperature  pattern 
during  a  year”  and  “what  is  a  good  guess  for  the  next  Christmas'  phone  traffic?”  Furthermore,  we 
would  ideally  want  to  discover  such  patterns  automatically  and  receive  appropriate  alerts. 

This  problem  is  orthogonal  to  that  of  continuous  query  processing.  We  focus  on  an  adaptive 
algorithm  that  can  look  for  arbitrary  patterns  and  requires  no  prior  human  intervention  to  guide  it. 
There  arc  situations  when  we  do  not  know  beforehand  what  we  arc  looking  for.  Furthermore,  it  may 
be  impossible  to  guide  the  sensor  as  it  collects  data,  due  to  the  large  volume  of  data  and/or  limited  or 
unavailable  communication.  If  further  exploration  is  desired,  users  can  issue  targeted  queries  later  on, 
guided  by  the  general  long-term  patterns  to  quickly  narrow  down  the  “search  space.” 

In  detail,  the  main  requirements  are: 

•  No  human  in  the  loop:  The  method  should  not  require  human  intervention  before  or  during 
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data  gathering.  In  a  general  sensor  setting  we  cannot  afford  human  intervention. 

•  Periodic  component  identification:  Humans  can  achieve  this  task,  visually,  from  the  time- 
plot.  Our  method  should  automatically  spot  multiple  periodic  components,  each  of  unknown, 
arbitrary  period. 

•  Online,  one-pass  algorithm:  We  can  afford  neither  the  memory  or  time  for  offline  updates, 
much  less  multiple  passes  over  the  data  stream. 

•  Limited  memory:  Sensor  memory  will  soon  be  exhausted,  unless  our  method  carefully  detects 
redundancies  (or  equivalently,  patterns)  and  exploits  them  to  save  space.  Furthermore,  we  ideally 
want  our  models  to  collect  data  even  when  network  connectivity  is  intermittent  (e.g.,  due  to 
power  constraints)  or  even  non-existent. 

•  Simple,  but  powerful  patterns:  We  need  simple  patterns  (i.e.,  equations,  rules)  which  can  be 
easily  communicated  to  other  nearby  sensors  or  to  a  central  processing  site.  These  patterns 
should  be  powerful  enough  to  capture  most  of  the  regularities  in  real-world  signals. 

•  Any-time  forecasting/outlier  detection:  It  is  not  enough  to  do  compression  (e.g.,  of  long  si¬ 
lence  periods,  or  by  ignoring  small  Fourier  or  wavelet  coefficients).  The  model  should  be  gen¬ 
erative,  in  order  to  spot,  store  and  report  outliers:  an  outlier  can  be  defined  as  any  value  that 
deviates  too  much  from  our  forecast  (e.g.,  by  two  standard  deviations).  It  should  be  able  to  do 
so  immediately,  in  real  time. 

Our  AWSOM  model  has  all  of  these  characteristics,  while  none  of  the  previously  published  methods 
(AR  and  derivatives,  Fourier  analysis,  wavelet  decomposition — see  Section  2. 1)  can  claim  the  same. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2  discusses  the  related  work.  Section  3  briefly 
presents  relevant  background  (some  further  information  is  contained  in  the  appendices).  Section  4 
describes  the  proposed  method  and  its  algorithms.  Section  5  presents  experimental  results  on  real  and 
synthetic  datasets.  Section  6  gives  the  conclusions. 

2  Related  work 

An  interesting  method  for  discovering  representative  trends  in  time  series  using  so-called  sketches 
was  proposed  by  Indyk  et.  al.  [IKMOO].  A  representative  trend  is  a  section  of  the  time  series  itself 
that  has  the  smallest  sum  of  “distances”  from  all  other  sections  of  the  same  length.  The  proposed 
method  employs  random  projections  [JL84]  for  dimensionality  reduction  and  FFT  to  quickly  compute 
the  sum  of  distances.  However,  it  cannot  be  applied  to  semi-infinite  streams,  since  each  section  has  to 
be  compared  to  all  others. 

Recent  work  by  Gilbert  et.  al.  [GKMS01]  uses  wavelets  to  compress  the  data  into  a  fixed  amount 
of  memory,  by  keeping  track  of  the  largest  Haar  wavelet  coefficients  and  updating  them  on-line  (in 
the  following,  we  will  use  the  name  Incremental  DWT  or  IncDWT  for  short).  The  work  in  [GGI+02a] 
presents  this  in  the  context  of  piecewise-constant  histogram  maintenance  (see  also  [GKS02]).  Also, 
[GGI  02b]  presents  a  novel  method  for  approximate  estimation  of  the  largest  Fourier  coefficients  by 
sampling  only  a  portion  of  the  series  and  proves  bounds  based  on  the  uncertainty  principle.  However, 
all  the  above  methods  do  not  try  to  discover  patterns  and  trends  in  the  data.  Thus,  they  cannot  compete 
directly  with  our  proposed  method,  which  employs  a  generative  model  for  pattern  and  trend  detection. 
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More  recently,  Garofalakis  et.  al.  [GG02]  presented  an  approach  for  accurate  data  compression 
using  probabilistic  wavelet  synopses.  However,  this  method  has  an  entirely  different  focus  and  cannot 
be  applied  to  semi-infinite  streams. 

Further  work  on  streams  focuses  on  providing  exact  answers  to  pre-specified  sets  of  queries  using  a 
minimum  amount  of  memory.  Arvind  et.  al.  [ABB+02]  study  the  memory  requirements  of  continuous 
queries  over  relational  data  streams.  Datar  et.  al.  [DGI+02]  keep  exact  summary  statistics  and  provide 
theoretical  bounds  in  the  setting  of  a  bitstream. 

There  is  also  some  recent  work  on  approximate  answers  to  various  types  of  continuous  queries. 
Gehrke  et.  al.  [GKS01]  presents  a  comprehensive  approach  for  answering  correlated  aggregate  queries 
(e.g.,  “find  points  below  the  (current)  average”),  using  histogram  summaries  to  approximate  aggre¬ 
gates.  Dobra  et.  al.  [DGGR02]  present  a  method  for  approximate  answers  to  aggregate  multi-join 
queries  over  several  streams,  by  using  random  projections  and  boosting. 

Finally,  a  comprehensive  system  for  linear  regression  on  multi-dimensional  time  series  data  was 
presented  very  recently  in  [CDH+02].  Although  this  framework  employs  varying  resolutions  in  time, 
it  does  so  by  straight  aggregation,  using  pre-specified  aggregation  levels  (although  the  authors  discuss 
the  use  of  a  geometric  progression  of  time  frames)  and  can  only  deal  with  linear  trends,  using  straight 
lineal-  regression  (as  opposed  to  auto-regression). 

2.1  Previous  approaches 

None  of  the  continuous  querying  methods  deal  with  pattern  discovery  and  forecasting.  The  typical 
method  for  forecasting  (i.e.,  generative  time  series  modeling)  uses  the  traditional  auto-regressive  (AR) 
models  or  their  generalizations,  auto-regressive  moving  average  (ARMA),  auto-regressive  integrated 
moving  average  (ARIMA )  and  seasonal  ARIMA  ( SARIMA ) — see  Appendix  A.  Although  popular,  these 
methods  fail  to  meet  many  of  the  requirements  listed  in  the  introduction.  The  most  important  fail¬ 
ure  is  that  they  need  human  intervention  and  fine-tuning.  As  mentioned  in  statistics  textbooks  such 
as  [BD91]: 

“The  first  step  in  the  analysis  of  any  time  series  is  to  plot  the  data.  [...]  Inspection  of 
a  graph  may  also  suggest  the  possibility  of  representing  the  data  as  a  realization  of  [ the 
‘classical  decomposition’  model].” 

Thus,  such  methods  are  not  suited  for  remote,  unsupervised  operation.  Furthermore,  these  methods 
have  a  number  of  other  limitations: 

•  Existing  methods  for  fitting  models  are  typically  batch-based,  that  is,  they  do  not  allow  for 
recursive  update  of  model  parameters. 

While  recursive  least  squares  is  effective  in  solving  this  problem  for  AR  models,  it  does  not 
generalize  to  handle  the  more  general  class  of  ARMA  models. 

•  Established  methods  for  determining  model  structure  (i.e.,  period  of  seasonal  components  and 
order  of  the  ARIMA  model)  are  at  best  computationally  intensive,  besides  not  easy  to  automate. 

•  If  there  are  non-sinusoidal  periodic  components,  ARIMA  models  will  miss  them  completely, 
unless  specifically  instructed  to  use  a  large  window  (for  instance.  Mother’s  day  traffic  will  not 
be  captured  with  a  window  less  than  365  days  long). 
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Large  window  sizes  introduce  further  estimation  problems,  both  in  terms  of  resource  require¬ 
ments  as  well  as  accuracy.  SARIMA  models  provide  a  workaround,  but  can  deal  only  with  one 
or,  at  best,  a  few  pre- determined,  constant  periods. 

•  In  addition,  ARIMA  models  do  not  do  a  good  job  of  handling  “bursty”  time  series.  ARIMA’s 
lineal-  difference  equations  eventually  lead  to  to  sinusoidal  signals  of  constant  or  exponentially 
decreasing  amplitude  (or  mixtures  thereof).  Thus,  it  can  not  generate  signals  with  strong  bursts, 
like,  e.g.,  disk  traffic  [WMC+02]  or  LAN  traffic  [RMSCB99],  even  when  these  bursts  are  re¬ 
occurring  (in  fact,  such  bursts  may  even  lead  to  exponential  divergence  in  a  generated  sequence). 

While  GARCH  models  introduced  by  [B0I86]  provide  a  means  of  modeling  the  class  of  bursty 
white  noise  sequences,  and  a  combined  SARIMA-GARCH  model  can  indeed  model  bursty  time 
series  with  seasonal  components,  the  computational  difficulties  involved  with  such  models  are 
far  greater  than  those  for  the  SARIMA  models. 

Recently,  the  ARIMA  model  has  been  extended  to  ARFIMA  (auto-regressive  fractionally  inte¬ 
grated  moving  average),  which  handles  the  class  of  self-similar  bursty  sequences  (such  as  Frac¬ 
tional  Gaussian  Noise  [Ber94]).  However,  ARFIMA  and  its  generalizations  are  even  harder  than 
ARIMA  to  use  and  require  human  expert  intervention  to  determine  yet  another  coefficient,  the 
Hurst  exponent  [CB96], 

All  the  above  methods  deal  with  linear  forecasting.  Non-linear  modeling,  using  chaos  and  fractals, 
has  been  attracting  increasing  interest  [WG94],  However,  these  methods  also  require  the  intervention 
of  a  human  to  choose  the  appropriate  windows  for  (non-linear)  regression  or  to  configure  an  artificial 
neural  network. 

Data  streams  are  essentially  signals.  There  is  a  huge  body  of  work  in  the  signal  processing  liter¬ 
ature  related  to  compression  and  feature  extraction.  Typical  tools  include  the  Fast  Fourier  Transform 
(FFT)  [OS75],  as  well  as  the  Discrete  Wavelet  Transform  (DWT)  [PWOO].  However,  most  of  the  algo¬ 
rithms  (a)  deal  with  fixed  length  signals  of  size  N,  and  (b)  cannot  do  forecasting  (i.e.,  do  not  employ  a 
generative  model). 

Thus  the  authors  believe  that  there  is  a  need  (see  also  Table  2)  for  straightforward  methods  of  time 
series  model  building  which  can  be  applied  in  real-time  to  semi-infinite  streams  of  data,  using  limited 
memory. 


3  Background  material 

In  this  section  we  give  a  very  brief  introduction  to  the  most  relevant  background  material  (the  appen¬ 
dices  contain  some  further  information). 

3.1  Auto-regressive  modeling 

Auto-regressive  models  (also  known  as  the  Box-Jenkins  method  [BJR94])  are  the  most  widely  used.  A 
more  complete  enumeration  of  AR  models  can  be  found  in  Appendix  A.  The  main  idea  is  to  express 
Xt  as  a  function  of  its  previous  values,  plus  (filtered)  noise  ep. 

Xt  =  f  iXt-i  +  . . .  +  fwXt-w  +  et  (1) 
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Symbol 

Definition 

xt 

Stream  value  at  time  tick  t  =  0,1,.... 

N 

Number  of  points  so  far  from  { A/}. 

wt,t 

Wavelet  coefficient  (level  l  and  time  t). 

Vi,t 

Scaling  coefficient  (level  l  and  time  t ). 

Psi.St 

AWSOM  coefficient,  (51,  St)  G  T>). 

V 

Window  offsets  for  AWSOM  (window  “size”  is  \V\). 

AWSOMA(no, . . . ,  n\) 

Number  of  level  offsets  (A)  and  offsets  per  level  (no, . . . ,  n\) 
in  V — see  Definition  2.  (no, . . .  ,n\)  is  also  called  the 
AWSOM  order. 

na,  a,  t 

Parameters  that  determine  the  number  of  equations  in  the  full 
model,  AWSOMa,t(^0;  •  •  • ,  n\) — see  Definition  3. 

Table  1:  Symbols  and  definitions. 


where  W  is  a  window  that  is  determined  by  trial  and  error,  or  by  using  a  criterion  that  penalizes  model 
complexity  (i.e.,  large  values  of  W),  like  the  Akaike  Information  Criterion  (AIC). 

AR(I)MA  requires  manual  preprocessing  by  trained  statisticians  to  remove  trends  and  seasonal¬ 
ities,  typically  by  visual  inspection  of  the  sequence  itself,  as  well  as  its  Auto-Correlation  Function 
(ACF). 

3.2  Recursive  least  squares 

Recursive  Least  Squares  ( RLS )  is  a  method  that  allows  dynamic  update  of  a  least-squares  tit.  Updates 
can  be  done  in  0(k2)  time  and  space,  where  k  is  the  number  of  regression  variables.  More  details  arc 
given  in  Appendix  C  and  in  [You84]. 

3.3  Wavelets 

The  At -point  discrete  wavelet  transform  (DWT)  of  a  length  N  time  sequence  gives  N  wavelet  coef¬ 
ficients.  Each  coefficient  is  responsible  for  a  frequency  range  within  a  time  window  (the  higher  the 
frequency,  the  smaller  the  time  window).  Figure  1  shows  the  scalogram,  that  is,  the  “map”  of  the 
magnitude  of  each  wavelet  coefficient  versus  the  location  in  time  and  frequency  it  is  “responsible”  for. 

The  DWT  of  a  sequence  can  be  computed  in  O(N)  time.  Furthermore,  as  new  points  arrive,  it 
can  be  updated  in  0(1)  amortized  time.  This  is  made  possible  by  the  structure  of  the  decomposition 
into  time  and  frequency  (explained  shortly)  which  is  unique  to  wavelets.  For  instance,  the  Fourier 
transform  also  decomposes  a  signal  into  frequencies  (i.e.,  sum  of  sines),  but  requires  O(NlgN)  time 
to  compute  and  cannot  be  updated  as  new  points  arrive.  A  brief  introduction  to  the  wavelet  transform 
can  be  found  in  Appendix  B. 

For  our  purposes  here,  we  shall  restrict  ourselves  to  wavelets  of  the  Daubechies  family,  which 
arc  easy  to  compute  (even  in  the  case  of  infinite  streams),  have  desirable  smoothness  properties  and 
successfully  compress  many  real  signals.  In  practice,  although  by  far  the  most  commonly  used  (largely 
due  to  their  simplicity),  Haar  wavelets  arc  too  unsmooth  and  introduce  significant  artifacting  [PWOO]. 
In  fact,  unless  otherwise  specified,  we  use  Daubechies-6. 
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time 


Bases  Scalogram 

Figure  1:  Haar  bases  and  correspondence  to  time/frequency  (for  signal  length  N  =  16).  Each  wavelet 
coefficient  is  a  linear  projection  of  the  signal  to  the  respective  basis. 


Incremental  wavelets  This  paid  is  a  very  brief  overview  of  how  to  compute  the  DWT  incrementally. 
This  is  the  main  idea  of  IncDWT  [GKMS01],  which  uses  Haar  wavelets.  In  general,  when  using  a 
wavelet  filter  of  length  L,  the  wavelet  coefficient  at  a  particular  level  is  computed  using  the  L  cor¬ 
responding  scaling  coefficients  of  the  previous  level.  Recall  that  L  =  2  for  Haar,  and  L  =  6  for 
Daubechies-6  that  we  typically  use.  Thus,  we  need  to  remember  the  last  L  —  1  scaling  coefficients  at 
each  level.  We  call  these  the  wavelet  crest. 

Definition  1  (Wavelet  crest).  The  wavelet  crest  at  time  t  is  defined  as  the  set  of  scaling  coefficients 
( wavelet  smooths)  that  need  to  be  kept  in  order  to  compute  the  new  wavelet  coefficients  when  X/ 
arrives. 

Lemma  1  (DWT  update).  Updating  the  wavelet  crest  requires  space  (L— l)lg  N +L  =  O(LlgN)  = 
0(lg  N),  where  L  is  the  width  of  the  wavelet  filter  (fixed)  and  N  the  number  of  values  seen  so  far. 

Proof.  See  [GKMS01].  Generalizing  to  non-Haar  wavelets  and  taking  into  account  the  wavelet  filter 
width  is  straightforward.  □ 

3.3.1  Wavelet  properties 

In  this  section  we  focus  on  some  of  the  properties  of  the  DWT  which  arc  relevant  to  AWSOM. 

Time/frequency  decomposition  Notice  (see  scalogram  in  Figure  1)  that  higher  level  coefficients  arc 
highly  localized  in  time,  but  involve  uncertainty  in  frequency  and  vice-versa.  This  is  a  fundamental 
trade-off  of  any  time/frequency  representation  and  is  a  manifestation  of  the  uncertainty  principle, 
according  to  which  localization  in  frequencies  is  inversely  proportional  to  localization  in  time.  This 
is  a  fundamental  principle  that  implies  certain  trade-offs  in  any  method!  When  dealing  with  semi¬ 
infinite  streams  in  limited  memory,  we  unavoidably  need  to  make  certain  trade-offs.  Given  this,  the 
wavelet  representation  is  an  excellent  choice:  it  “compresses”  well  many  real  signals,  while  it  is  fast 
to  compute  and  can  be  updated  online. 
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Method 

Contin. 

Streams 

Trends  / 

Forecast 

Auto¬ 

matic 

Memory 

DFT  (IV-point) 

NO 

NO 

— 

— 

SWFT  (iV-point) 

YES(?) 

NO 

— 

— 

DWT  (IV-point) 

NO 

NO 

— 

— 

IncDWT  [GKMS01] 

YES 

NO 

— 

— 

Sketches  [IKMOO] 

NO 

YES(?) 

— 

— 

AR  /  ARIMA 

YES 

YES 

NO  [BJR94] 

W'2 

AWSOM 

YES 

YES 

YES 

m\D\2 

Table  2:  Comparison  of  methods. 


Wavelets  and  decorrelation  A  wavelet  transform  using  a  wavelet  filter  of  length  2 L  can  decorrelate 
only  certain  signals,  assuming  their  L-th  order  (or  less)  backward  difference  is  a  stationary  random 
process  [PWOO].  For  real  signals,  this  value  of  L  is  not  known  in  advance  and  may  be  impractically 
large:  the  space  complexity  of  computing  new  wavelet  coefficients  is  O(LlgN) — see  Lemma  1. 

Wavelet  variance  One  further  benefit  of  using  wavelets  is  that  they  decompose  the  variance  across 
scales.  Furthermore,  the  plot  of  log-power  versus  scale  can  be  used  to  detect  self-similar  components 
(see  Appendix  B.l  for  a  brief  overview). 

4  Proposed  method 

In  this  section  we  introduce  our  proposed  model. 

4.1  Intuition  behind  our  method 

What  equations  should  we  be  looking  for  to  replace  ARIMA’s  (see  Equation  1)?  In  particular,  how  can 
we  capture  periodic  components  of  arbitrary  period? 

First  part — information  representation  As  explained  in  section  3.3.1,  given  our  limited  resources, 
we  have  to  make  a  choice  about  how  to  efficiently  and  effectively  capture  the  important  information 
in  the  sequence.  Traditional  models  (such  as  ARIMA)  operate  directly  in  the  time  domain.  Thus,  they 
cannot  deal  with  redundancies,  seasonalities,  long-range  behavior,  etc.  This  is  where  a  human  expert  is 
needed  to  manually  detect  these  phenomena  and  transform  the  series  to  match  ARIMA’s  assumptions. 

This  is  a  crucial  choice — is  there  a  better  one?  We  want  a  powerful  and  flexible  representation 
that  can  adapt  to  the  sequence,  rather  than  expect  someone  to  adapt  the  sequence  to  the  represen¬ 
tation.  Wavelets  arc  extremely  successful  in  compressing  most  real  signals,  such  as  voice  and  im¬ 
ages  [Fal96,  Fie93,  PTVF92],  seismic  data  [ZdZ98],  biomedical  signals  [Aka97]  and  economic  time 
sequences  [GSW01]. 

By  using  wavelet  coefficients,  we  immediately  discard  many  redundancies  (i.e.,  near-zero  valued 
wavelet  coefficients)  and  focus  on  the  things  that  really  matter.  Furthermore,  the  DWT  can  be  com¬ 
puted  quickly  and  updated  online. 
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Figure  2:  AWSOM — Intuition  and  demonstration.  AWSOM  captures  intra-  and  inter-scale  correla¬ 
tions.  Also,  the  left  figure  demonstrates  why  we  fit  different  models  per  level. 

Second  part — correlation  structure  In  the  wavelet  domain,  how  can  we  capture  arbitrary  periodic¬ 
ities?  The  answers  come  from  the  properties  of  the  DWT.  A  periodic  signal  (even  a  spike  train — e.g.. 
Mother’s  day  traffic),  will  have  high  absolute  values  for  the  wavelet  coefficients  at  the  scales  that  corre¬ 
spond  to  its  frequency.  Even  more,  successive  coefficients  on  the  same  level  should  have  related  values 
(see  Figure  2,  left).  Thus,  in  order  to  capture  periodic  components,  we  should  look  for  correlations 
between  wavelet  coefficients  within  the  same  level. 

Furthermore,  how  should  we  capture  bursts?  Short  bursts  carry  energy  in  most  frequencies.  There¬ 
fore  wavelet  coefficients  across  different  scales  (i.e.,  frequencies)  will  have  large  values  (see  Figure  2, 
middle).  If  the  phenomenon  follows  some  pattern,  then  it  is  likely  that  there  will  be  an  inter-scale 
correlation  among  several  of  the  wavelet  coefficients. 

Third  part — correlation  modeling  The  last  question  we  need  to  answer  is:  what  type  of  regression 
models  should  we  use  to  quantify  these  correlations?  Our  proposed  method  tries  to  capture  inter-  and 
intra-scale  correlations  by  fitting  a  linear  regression  model  in  the  wavelet  domain.  These  can  also  be 
updated  online  with  RFS. 

To  summarize,  we  have  argued  for  using  the  wavelet  representation  of  the  series  and  capturing 
correlations  within  and  across  scales  (see  Figure  2,  right).  We  have  “substituted,”  in  effect,  the  hu¬ 
man  expert  with  the  DWT  and  this  correlation  structure.  Thus,  we  expect  that  linear  regression  will 
successfully  capture  these  correlations  with  very  few  coefficients. 
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UpdateCrest  (A[f]): 

Update  (X[t]): 

Foreach  l  >  0  s.t.  2l  divides  t: 

UpdateCrest(  X  [t  ] ) 

Compute  V[l,t/2l] 

Foreach  new  coefficient  W[l,  t'}  in  the  crest: 

If  2l+l  divides  t: 

Find  the  linear  model  it  belongs  to 

Compute  W[l,t/2l+l] 

based  on  l  and  t!  mod  A 

Delete  W[l,t/2l+1  -  L] 

Update  XT X  and  XTy  for  this  model 

Figure  3:  High-level  description  of  update  algorithms. 

4.2  AWSOM  modeling 

Formally,  our  proposed  method  tries  to  tit  models  of  the  following  form: 

Wijt=  A),iWz,t-i  +  Po,2Wi:t-2  +  ■  ■  ■ 

/?l,oW/_l,t/2+  Pl,lWl-l,t/2-l  +  ■  •  • 

/?2,0  AF^_2,i/4+  •  •  • 


or  more  concisely  (where  e;  *  is  the  usual  error  term) 

^  Psi,St^l+5l,t/2sl-St  +  el,t  (2) 

(8l,8t)eV 

where  D  is  a  set  of  index  offsets.  The  (3si,St  are  called  the  AWSOM  coefficients. 

Definition  2  (AWSOM  order).  The  set  of  offsets  is  always  of  the  form 

V  =  {  (0,1),  (0,2),  ...,  (0,n0), 

(1,0),  (1,1),  (1,2),  ...,  (l,m  -  1), 

•  ■  •  ) 

(A,0),  ...,  (A, n\  —  1)  } 

i.e.,  each  wavelet  coefficient  is  expressed  as  a  function  of  the  previous  n o  wavelet  coefficients  on  the 
same  level,  n\  coefficients  from  one  level  below  and  so  on.  For  a  particular  choice  of  V,  we  use 
AWSOMA(no, . . . ,  ri\)  or  simply 


AWSOM(no,  ni-,  ■  ■  ■ ,  n\) 

to  denote  this  instance  of  our  model.  We  call  (no, . . . ,  n\)  the  order  of  the  AWSOM  model.  The  total 
order  is  the  number  of  AWSOM  coefficients  k  per  equation,  i.e.,  k  =  Y^5i= o  nA/- 

A  fixed  choice  of  V  is  sufficient  for  all  signals.  In  most  of  our  experiments  we  use  AWSOM(6, 4,  2) 

(k  =  12). 

The  naive  approach  would  be  to  fit  Equation  2  to  all  data  points  (i.e.,  wavelet  coefficients).  How¬ 
ever,  an  approach  that  gives  more  accurate  results  is  to  fit  one  equation  per  level  (see  Figure  2),  as  long 
as  the  level  contains  enough  wavelet  coefficients  to  get  a  good  fit.  Thus,  in  actual  use  on  a  running 
stream,  we  would  fit  one  equation  for  every  level  l  <  A,  where  A  is  the  level  that  has  no  more  than. 
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Figure  4:  Illustration  of  AWSOM2,2(l,  1)  with  N\  =  2.  The  shade  of  each  wavelet  coefficient 
corresponds  to  the  model  equation  used  to  “predict”  it.  The  unshaded  wavelet  coefficients  correspond 
to  initial  conditions  (i.e.,  with  incomplete  AWSOM  window  V). 

say,  N\  =  16  wavelet  coefficients.  For  levels  l  >  A  we  can  either  keep  the  exact  wavelet  coefficients 
(which  would  be  no  more  than  16  +  8  +  --  -  +  l  =  31inthe  above  case)  and/or  fit  one  more  equation. 
Thus,  as  more  data  points  arrive,  the  value  of  A  gradually  increases  as  necessary. 

Besides  fitting  one  equation  per  level  (up  to  A),  when  we  use  AWSOMa  with  A  >  1,  we  also 
fit  different  equations  depending  on  time  location  t.  For  instance,  if  we  arc  using  AWSOM1  (no,  2), 
we  should  fit  one  equation  for  pairs  Wi^t  and  Wj_ 1;t  and  another  for  pairs  Wi^t+i  and  ITS  - 1 ./  (see 
Figure  4).  In  general,  we  need  2A  separate  models  to  ensure  that  the  inter-scale  correlations  A  levels 
down  arc  not  “shoehorned”  into  the  same  regression  model. 

To  summarize,  the  full  AWSOM  model  fits  a  number  of  equations: 

Wijt  =  ^2  filSi*StWi+sitt-stei,t  (3) 

{5l,St)eT> 

where  1/  =  min(Z,  A)  and  t!  =  t  mod  T.  For  example,  if  T  =  2  and  A  =  1,  we  estimate  one  linear 
equation  for  each  set  of  wavelet  coefficients  Wq^i,  H/o,2i+i>  TT),2i  and  TT),2fc+i  ((  >  1,  i  >  0).  The 
significant  advantage  of  this  approach  is  that  we  can  still  easily  update  the  AWSOM  equations  online, 
as  new  data  values  arrive.  This  is  possible  because  the  equation  is  selected  based  only  on  l  and  t  for 
the  new  wavelet  coefficient  (see  also  Figure  3). 

Definition  3.  For  the  full  model  described  above,  we  use  the  notation 

AWSOMa, t(™0, n±, . .  .,n\) 

The  number  of  equations  rri  is  simply  rri  =  AT. 

The  value  of  A  depends  on  N  and  the  value  of  T  is  fixed  and  depends  only  on  A.  In  general,  we 
automatically  pick  the  following  values: 

r  N 

T  ~  2a  and  A  ~  lg  — —  =  lg  /V  -  lg  NA  -  A 
A  a  1 
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ModelSelection: 

For  each  linear  model: 

Estimate  SSR  of  complete  model 
For  each  subset  of  regression  variables: 
Compute  SSR  of  reduced  model 
from  3 

Estimate  probability  that  reduction 
in  variance  is  not  due  to  chance 
Select  the  subset  of  variables  with 
highest  probability  (or  keep  all 
if  not  within  95%  confidence  interval) 


Figure  5:  High-level  description  of  basic  selection  algorithm. 

4.3  Model  selection 

When  fitting  a  linear  model,  as  we  increase  the  number  of  variables  (or,  in  the  context  of  forecasting,  as 
we  increase  the  window  size),  we  expect  in  general  to  get  a  better  fit.  However,  what  we  would  really 
like  to  find  are  those  variables  that  have  a  statistically  significant  contribution  to  the  output  value  (or, 
forecast).  The  reasons  for  this  arc  twofold: 

•  Over-fitting  the  data  may  result  in  a  good  approximation  of  the  past,  but  a  number  of  the  correla¬ 
tions  may  in  reality  be  due  to  noise  and  not  carry  over  well  in  forecasts  of  the  future.  Therefore, 
by  picking  only  the  important  variables,  we  improve  accuracy. 

•  More  importantly,  in  the  pattern-mining  context,  we  want  to  filter  out  the  effects  of  noise  and 
present  only  those  patterns  that  arc  important  to  the  user. 

More  details  on  model  selection  arc  given  in  Appendix  D  (see  also  Figure  5). 

Model  selection  and  combination  needs  to  be  done  when  interpreting  the  models;  processing  on  the 
sensor  is  possible,  but  not  necessary.  In  fact,  all  operations  can  be  performed  using  only  data  gathered 
online  and  time  complexity  is  independent  of  the  stream  size.  The  only  thing  that  needs  to  be  decided 
in  advance  is  the  largest  AWSOM(no, . . . ,  ri\ )  order  we  may  want  to  tit.  From  the  data  collected,  we 
can  then  automatically  select  any  model  of  smaller  order  (AWSOM(ng, . . . ,  n'x,),  where  X'  <  X  and 
n'  <  rii). 

4.4  Complexity 

In  this  section  we  show  that  our  proposed  AWSOM  models  can  be  easily  estimated  with  a  single¬ 
pass,  “any-time”  algorithm.  From  Lemma  1,  estimating  the  new  wavelet  coefficients  requires  space 
0(lg./V).  In  fact,  since  we  typically  use  Daubechies-6  wavelets  (L  =  6),  we  need  to  keep  exactly 
5  lg  N  +  6  values.  The  AWSOM  models  can  be  dynamically  updated  using  RES. 

Lemma  2  (Logarithmic  space  complexity).  Maintaining  the  model  requires  0( lg  N  +  mk2 )  space, 
where  N  is  the  length  of  the  signal  so  far,  k  is  the  number  of  AWSOM  coefficients  in  each  equation 
and  m .  the  number  of  equations. 
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Dataset 

Size 

Description 

Triangle 

64K 

Triangle  wave  (period  256) 

Mix 

256K 

Square  wave  (period  256) 
plus  sine  (period  64) 

Impulses 

64K 

Impulse  train  (every  256  points) 

ARFIMA 

2K 

Fractionally  differenced  ARIMA 
(R  package  fracdiff). 

Sunspot 

2K 

Sunspot  data 

Disk 

2K 

Disk  access  trace  (from 
Hewlett-Packard) 

Automobile 

32K 

Automobile  traffic  sensor  trace 
from  large  Midwestern  state 

Table  3:  Description  of  datasets  (sizes  are  in  number  of  points,  1K=1024  points). 

Proof.  Keeping  the  wavelet  crest  scaling  coefficients  requires  space  0(lg  N).  If  we  use  recursive  least 
squares,  we  need  to  maintain  a  k  x  k  matrix  for  each  of  the  m  equations  in  the  model.  □ 

Auto-regressive  models  with  a  comparable  window  size  need  space  0(m2k2),  since  the  equivalent 
fair  window  size  is  IT  ~  rrik.  Here,  “fair”  means  that  the  number  of  total  number  of  AWSOM 
coefficients  plus  the  number  of  initial  conditions  we  need  to  store  is  the  same  for  both  methods.  This  is 
the  information  that  comprises  the  data  synopsis  and  that  would  have  to  be  eventually  communicated. 
However,  the  device  gathering  the  measurements  needs  extra  storage  space  in  order  to  update  the 
models.  The  latter  is,  in  fact,  much  larger  for  AR  than  for  AWSOM  (see  Figure  6).  Thus  this  definition 
of  equivalent  window  actually  favors  AR. 

Theorem  1  (Time  complexity).  Updating  the  model  when  a  new  data  point  arrives  requires  0(k2) 
tune  on  average,  where  k  is  the  number  of  AWSOM  coefficients  in  each  equation. 

Proof.  On  average,  the  wavelet  crest  scaling  coefficients  can  be  updated  in  0(1)  amortized  time.  Al¬ 
though  a  single  step  may  require  0(lg  N)  time  in  the  worst  case,  on  average,  the  (amortized)  time 
required  is  OQ0"=(J  Bii)/N)  =  0(1)  (where  B{i)  is  the  number  of  trailing  zeros  in  the  binary  repre¬ 
sentation  of  i) 1 . 

Updating  the  k  x  k  matrix  for  the  appropriate  linear  equation  (which  can  be  identified  in  0(1), 
based  on  level  l  and  on  t  rnodT),  requires  time  0(k2).  □ 

Once  again,  auto-regressive  models  with  a  comparable  window  size  need  time  0(m2k2)  for  each 
update. 

Corollary  1  (Constant-time  update).  When  the  model  parameters  have  been  fixed  ( typically  k  is  a 
small  constant  ~  10  and  m  ~  lg  N ),  the  model  requires  space  0(lg  N)  and  amortized  time  0(1)  for 
each  update. 


'Seen  differently,  IncDWT  is  essentially  an  pre-order  traversal  of  the  wavelet  coefficient  tree. 
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Triangle  Mix  Imptrain  Sunspot  Disk  Automobile 


Figure  6:  Memory  space  requirements:  Space  needed  to  keep  the  models  up-to-date  (AWSOM  and 
AR  with  equivalent,  fair  window  size. 

5  Experimental  evaluation 

We  compared  AWSOM  against  standard  AR  (with  the  equivalent,  fair  window  size — see  Section  4.4), 
as  well  as  hand-tuned  (S)ARIMA  (wherever  possible).  Our  prototype  AWSOM  implementation  is 
written  in  Python,  using  Numeric  Python  for  fast  array  manipulation.  We  used  the  standard  t  s  package 
from  R  (version  1.6.0 — see  http:  //www.  r-pro  ject .  org/)  for  AR  and  (S)ARIMA  models.  We 
illustrate  the  properties  of  AWSOM  and  how  to  interpret  the  models  using  synthetic  datasets  and  then 
show  how  these  apply  to  real  datasets  (see  Table  3). 

Only  the  first  half  of  each  sequence  was  used  to  estimate  the  models,  which  were  then  applied  to 
generate  a  sequence  of  length  equal  to  that  of  the  entire  second  half.  For  AR  and  (S)ARIMA,  the  last 
values  (as  dictated  by  the  window  size)  of  the  first  half  were  used  to  initiate  generation.  For  AWSOM 
we  again  used  as  many  of  the  last  wavelet  coefficients  from  each  DWT  level  of  the  first  half  as  were 
necessary  to  start  applying  the  model  equations.  We  should  note  that  generating  more  than,  say,  10 
steps  ahead  is  very  rare:  most  methods  in  the  literature  [WG94]  generate  one  step  ahead,  then  obtain 
the  correct  value  of  Xt+\,  and  only  then  try  to  generate  Xt+2.  Nevertheless,  our  goal  is  to  capture 
long-term  behavior  and  AWSOM  achieves  this  efficiently,  unlike  ARIMA. 

5.1  Interpreting  the  models 

Visual  inspection  A  “forecast”  is  essentially  a  by-product  of  any  generative  time  series  model:  ap¬ 
plication  of  any  model  to  generate  a  number  of  “future”  values  reveals  precisely  the  trends  and  patterns 
captured  by  that  model.  In  other  words,  synthesizing  points  based  on  the  model  is  the  simplest  way 
for  any  user  to  get  a  quick,  yet  fairly  accurate  idea  of  what  the  trends  arc  or,  more  precisely,  what 
the  model  thinks  they  arc.  Thus,  what  we  expect  to  see  (especially  in  a  long-range  forecast)  is  the 
important  patterns  that  can  be  identified  from  the  real  data. 

However,  an  expert  user  can  extract  even  more  precise  information  from  the  models.  We  will  now 
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Figure  7:  Disk  and  ARFIMA  datasets. 


explain  how  the  “AWSOM  language”  can  be  fully  interpreted. 

Variance  test  As  explained  in  Appendix  B.l,  if  the  signal  is  self-similar,  then  the  plot  of  log-power 
versus  scale  is  linear. 

Definition  4  (Variance  diagnostic).  We  call  the  log-power  vs.  scale  plot  the  wavelet  variance  diag¬ 
nostic  plot  ( or  just  variance  diagnostic ).  In  particular. ;  we  use  the  correlation  coefficient  pa  to  quantify 
the  relation.  If  the  plot  is  linear  (in  a  range  of  scales),  the  slope  a  is  the  self-similarity  exponent 
(— 1  <  a  <  0,  closer  to  zero  the  more  bursty  the  series). 

A  large  value  of  \pa\,  at  least  across  several  scales,  indicates  that  it  the  series  component  in  those 
scales  may  be  modeled  using  a  fractional  noise  process  with  parameter  dictated  by  a  (see  Automo¬ 
bile  dataset).  However,  we  should  otherwise  be  careful  in  drawing  further  conclusions  about  the 
behavior  within  these  scales. 

We  should  note  that  after  the  observation  by  [LTWW94],  fractional  noise  processes  and,  in  gen¬ 
eral,  self-similar  sequences  have  revolutionized  network  traffic  modeling.  Furthermore,  self-similar  se¬ 
quences  appeal-  in  atomic  clock  fluctuations,  river  minima,  compressed  video  bit-rates  [Ber94,  PWOO], 
to  mention  a  few  examples. 

Wavelet  variance  (energy  and  power)  The  magnitude  of  variance  within  each  scale  serves  as  an 
indicator  about  which  frequency  components  are  the  dominant  ones  in  the  sequence.  To  precisely 
interpret  the  results,  we  also  need  to  take  into  account  the  fundamental  uncertainty  in  frequencies  (see 
Figure  15).  However,  the  wavelet  variance  plot  quickly  gives  us  the  general  picture  of  important  trends. 
Furthermore,  it  guides  us  to  focus  on  AWSOM  coefficients  around  frequencies  with  large  variance. 

AWSOM  coefficients  Regardless  of  the  energy  within  a  scale,  the  AWSOM  coefficients  provide 
further  information  about  the  presence  of  trends  in  the  signal,  which  cannot  be  deduced  from  the 
variance  plots.  In  particular: 
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Figure  8:  Wavelet  variance  diagnostic. 


•  Large  intra-scale  coefficients:  These  capture  patterns  at  certain  frequencies,  regardless  of  the 
contribution  of  these  frequencies  to  overall  energy.  Furthermore,  if  the  coefficients  arc  not  the 
same  for  all  regression  models  at  the  same  level,  this  is  an  indication  of  “seasonalities”  within 
that  scale  and  capture  a  different  type  of  information  about  larger  frequencies. 

•  Large  inter-scale  coefficients:  These  occur  when  there  arc  repeated  bursts  in  the  series.  The 
number  of  scales  with  large  inter-scale  coefficients  depends  on  the  duration  of  the  bursts  (short 
bursts  have  large  bandwidth). 


To  summarize,  the  steps  arc: 

•  Examine  the  variance  diagnostic  to  identify  sub-bands  that  correspond  to  a  self-similar  compo¬ 
nent.  These  may  be  modeled  using  a  fractional  noise  process,  but  otherwise  we  cannot  say  much 
more. 

•  Examine  the  wavelet  and  energy  and  power  spectrum  to  quickly  identify  important  sub-bands. 

•  Examine  AWSOM  coefficients,  primarily  within  and  around  the  sub-bands  identified  during  the 
second  step. 

5.2  Synthetic  datasets 

We  present  synthetic  datasets  to  illustrate  the  basic  properties  of  AWSOM,  its  behavior  on  several 
characteristic  classes  of  sequences,  and  the  principles  behind  interpreting  the  models.  Applying  the 
models  to  generate  a  number  of  “future”  data  points  is  the  quickest  way  to  see  if  each  method  captures 
long-term  patterns. 

ARFIMA.  This  is  a  synthetic  dataset  of  a  fractional  noise  process,  which  illustrates  the  first  point 
about  AWSOM  model  interpretation.  As  seen  in  Figure  8,  this  exhibits  a  clearly  linear  relationship  in 
the  wavelet  variance  plot.  The  estimated  fractional  differencing  (“burstiness”)  parameter  5  =  —a/2  ~ 
0.24  is  close  to  the  actual  parameter  used  (6  =  0.3).  It  is  clear  that  no  periodic  component  is  present 
in  this  series. 

Triangle.  AR  fails  to  capture  anything,  because  the  window  is  not  large  enough.  SAR  esti¬ 
mation  (with  no  differencing,  no  MA  component  and  only  a  manually  pre-specified  lag-256  seasonal 
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Figure  9:  Forecasts — synthetic  datasets.  Note  that  AR  gives  the  wrong  trend  (if  any),  while  seasonal 
AR  fails  to  complete. 
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Figure  10:  Forecasts — real  datasets.  Note  that  AR  fails  to  detect  any  trend,  while  seasonal  AR  either 
fails  to  complete  or  gives  a  wrong  conclusion  (decaying  trend,  fixed  period)  in  260  x  time. 


component)  fails  completely.  In  fact,  R  segfaults  after  several  minutes,  even  without  using  maximum- 
likelihood  estimation  (MLE).  However,  AWSOM  captures  the  periodicity.  The  AWSOM  model  visu¬ 
alization  is  similar  to  that  for  Mix. 

Mix.  Once  again,  AR  is  confused  and  does  not  capture  even  the  sinusoidal  component.  SAR 
estimation  (without  MLE)  fails  (R’s  optimizer  returns  an  error,  after  several  minutes  of  computation). 
Figure  14  shows  the  AWSOM  coefficients.  We  show  only  the  levels  that  correspond  to  significant 
variance.  These  illustrate  the  first  point  in  the  interpretation  of  AWSOM  coefficients.  We  clearly  see 
strong  correlations  in  levels  6  and  8  (which  correspond  to  the  periods  26  =  64  and  28  =  256  of  the 
series  components).  Note  that  the  variance  alone  (see  also  Figure  15)  is  not  enough  to  convey  this 
information. 

Impulses.  Once  again,  AR  fails  to  capture  anything  and  SAR  estimation  fails.  AR  fails  because 
the  window  is  too  small.  However,  AWSOM  captures  the  overall  behavior.  Figure  14  illustrates  the 
second  point  in  the  interpretation  of  AWSOM  coefficients.  We  clearly  see  repeated  presence  of  bursts, 
with  strong  inter-scale  correlations  across  all  levels  up  to  the  impulse  “period”  (since  the  bursts  have 
width  one).  We  show  those  levels  that  correspond  to  the  bursts.  At  level  5,  information  from  the 
impulse  “period”  begins  to  enter  in  the  wavelet  coefficients  (see  also  Figure  15).  After  level  7,  the 
inter-scale  correlations  diminish  in  significance  and  the  interpretation  is  similar  to  that  for  Mix. 

We  should  mention  that  we  tried  SAR(0)x(l)128  on  an  impulse  train  of  period  128.  On  a  sequence 
with  1024  points,  R  takes  over  4  minutes  (on  a  signal  with  64K  points  it  did  not  complete  in  over  one 
hour).  However,  AWSOM  estimates  the  parameters  (with  64K  points)  in  approximately  50  seconds, 
although  our  prototype  is  implemented  in  Python. 
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5.3  Real  datasets 


For  the  real  datasets,  we  show  the  so-called  marginal  distribution  quantile-quantile  plots  (or  Q-Q 
plots — see  Figure  12  and  Figure  11).  These  are  the  scatter  plots  of  (x,  y )  such  that  p%  of  the  values 
are  below  x  in  the  real  sequence  and  below  y  in  the  generated  sequence.  When  the  distributions  are 
identical,  the  Q-Q  plot  coincides  with  the  bisector  of  the  first  quadrant. 

Sunspot.  This  dataset  is  well-known  and  it  has  a  time-varying  period.  AR  again  fails  completely. 
SAR  (without  a  MA  component,  much  less  MLE)  takes  40  minutes  to  estimate.  AWSOM  (in  Python) 
takes  less  than  9  seconds.  SAR  gives  a  completely  fixed  period,  captures  a  false  downward  trend  and 
misses  the  marginal  distribution  (see  Figure  12).  On  the  other  hand,  AWSOM  captures  the  general 
periodic  trend,  with  a  desirable  slight  confusion  about  the  period  (since  the  period  is  varying  and  thus 
un-predictable). 

Automobile.  This  dataset  has  a  strongly  linear  variance  diagnostic  in  scales  1-6  (Figure  8). 
Flowever,  the  lower  frequencies  contain  the  most  energy,  as  can  be  seen  in  the  variance  plot  (Figure  13. 
This  is  an  indication  that  we  should  focus  at  these  scales.  The  lowest  frequency  corresponds  to  a  daily 
periodicity  (approximately  4000  points  per  day,  or  about  8  periods  in  the  entire  series)  and  next  highest 
frequency  corresponds  to  the  morning  and  afternoon  rush-hour. 

In  this  series,  low  frequencies  can  be  modeled  by  fractional  noise.  Figure  1 1  shows  a  generated 
sequence  with  fractional  noise,  as  identified  by  AWSOM.  The  fractional  difference  parameter  is 
estimated  as  6  =  —a/2  ~  0.276  and  the  amplitude  is  chosen  to  match  the  total  variance  in  those 
scales. 

However,  for  unsupervised  outlier  detection,  this  is  not  necessary:  what  would  really  constitute 
an  outlier  would  be,  for  instance,  days  that  (a)  do  not  follow  the  daily  and  rush-hour  patterns,  or  (b) 
whose  variance  in  the  fractional  noise  scales  is  very  different.  This  can  be  captured  automatically  by 
the  series  components  in  the  appropriate  frequency  sub-bands  that  AWSOM  identities  as  a  periodic 
component  and  self-similar  noise,  respectively. 

Disk.  This  is  a  very  difficult  data  set  to  characterize  (even  by  humans).  It  exhibits  a  linear 
variance  diagnostic  (see  Figure  8)  across  all  scales.  Therefore,  the  regression  models  are  of  little  use 
in  this  case.  However,  from  the  variance  plot,  we  see  moderate  spikes  at  frequencies  that  correspond 
to  daily  periodicities  (48  points  per  day)  and,  to  a  lesser  extent,  weekly  periodicities.  However,  the 
presence  in  those  frequencies  is  fairly  week  and  the  points  (2K)  are  too  few  to  safely  draw  any  further 

Automobile  -  AWSOM  (3)  Automobile  (Q-Q) 


Figure  11:  Automobile — forecast  with  fractional  noise. 
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Figure  12:  Marginal  Q-Q  plots  (slope  and  correlation  coefficients  in  parentheses). 


conclusions;  AWSOM  provides  the  necessary  information  to  judge  what  conclusions  can  be  drawn. 
Both  AR  and  SAR  (with  a  hand-picked  lag  of  48  to  capture  the  daily  periods)  fail  completely  and  do 
not  even  provide  a  hint  about  the  weak  (compared  to  the  bursts)  periodic  components. 
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Figure  13:  Wavelet  variances. 
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6  Conclusions 


Sensor  networks  arc  becoming  increasingly  popular;  thanks  to  falling  prices  and  increasing  storage  and 
processing  power.  In  this  work  we  presented  AWSOM  (Arbitrary-Window  Stream  modeling  Method). 
Our  method  is  the  only  one  that  achieves  all  our  goals:  (1)  Unsupervised  operation:  once  we  decide 
the  largest  AWSOM  order  we  want,  no  further  intervention  is  needed:  the  sensor  can  be  left  alone 
to  collect  information.  (2)  ‘Any -time’,  one-pass  algorithm  to  incrementally  update  the  patterns.  (3) 
Automatic  detection  of  periodic  components  with  arbitrary  period.  (4)  Limited  memory:  our  method 
requires  0(lg  N)  memory  (where  N  is  the  length  of  the  sequence  so  far-).  (5)  Simplicity:  AWSOM 
provides  linear-  models  which  have  a  straightforward  interpretation.  (6)  Power:  AWSOM  provides 
information  across  several  frequencies  and  can  diagnose  self-similarity  and  long-range  dependence. 
(7)  Immediate  outlier  detection:  our  method,  despite  its  simplicity  and  its  automatic,  unsupervised 
operation,  is  nevertheless  able  to  do  forecasting.  It  can  do  so  directly,  for  the  estimated  model.  We 
showed  real  and  synthetic  data,  where  our  method  captures  the  periodicities  and  burstiness  of  the  input 
sequence,  while  the  traditional  ARIMA  method  fails  completely. 

AWSOM  is  an  important  first  step  toward  successful,  hands-off  data  mining  in  infinite  streams, 
combining  simplicity  with  modeling  power.  Continuous  queries  are  useful  for  evidence  gathering 
and  hypothesis  testing  once  we  know  what  we  are  looking  for.  AWSOM  is  the  first  method  to  deal 
directly  with  the  problem  of  automatic,  unsupervised  stream  mining  and  pattern  detection  and  fill  the 
gap.  Among  the  many  future  research  directions,  the  most  promising  seems  to  be  the  extension  of 
AWSOM  to  multiple,  co-evolving  time  sequences,  beyond  MUSCLES  [YSJ+00]  and  multi- variate 
ARIMA. 
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A  Auto-regressive  modeling 

The  simplest  form,  an  auto-regressive  model  of  order  p  or  AR(p),  expresses  Xt  as  a  linear  combination 
of  previous  values,  i.e.,  Xt  =  (j)iXt-i  +  •  •  •  +  fpXt-p  +  e*  or,  more  concisely 

f{L)Xt  =  et 


23 


where  L  is  the  lag  operator  and  <J>(L)  is  a  polynomial  defined  on  this  operator: 


LXt  = 

q KL )  =  1-01 L-  </>2L2 - 0pLp 


and  6f  is  a  white  noise  process,  i.e., 


E[et]  =  0  and  Cov[et,et_fc] 


fj2  if  k  =  0 
0  otherwise 


Using  least-squares,  we  can  estimate  c2  from  the  sum  of  squared  residuals  (SSR).  This  is  used  as  a 
measure  of  estimation  error;  when  generating  “future”  points,  e*  is  set  to  its  expected  value,  E[ej]  =  0. 

The  next  step  up  are  auto-regressive  moving  average  models.  An  ARMA(p,  q)  model  expresses 
values  Xt  as 

<KL)Xt  =  9(L)et 

where  9(L)  =  1  —  9\L  —  ■  ■  ■  —  9qLq.  Estimating  the  moving  average  coefficients  9t  is  fairly  involved. 
State  of  the  art  methods  use  maximum-likelihood  (ML)  algorithms,  employing  iterative  methods  for 
non-linear  optimization,  whose  computational  complexity  depends  exponentially  on  q. 

ARIMA(p,  d.  q)  models  arc  similar  to  ARMA(p,  q)  models,  but  operate  on  (1  —  L)dXt,  i.e.,  the 
d-th  order  backward  difference  of  Xt: 

(/>{L){l-L)dXt  =  9(L)et 

Finally,  SARIMA(p,  d,  q)  x  (P,  D,  Q)T  models  are  used  to  deal  with  seasonalities,  where: 
0(L)$(Lt)(  1  -  L)d(  1  -  LT)DXt  =  9{L)Q{LT)et 


where  the  seasonal  difference  polynomials 

$(Lr)  =  1  -  Lt  -  <5>2L2T - $PLPT 

0(Lr)  =  1  -  0iLt  -  Q2L2T - @qLqt 

arc  similar  to  4>(L)  and  9(L)  but  operate  on  lags  that  arc  multiples  of  a  fixed  period  T.  The  value  of  T 
is  determined  purely  by  trial  and  error,  or  by  utilizing  prior  knowledge  about  the  series  Xt. 


B  Discrete  Wavelet  Transform 

Wavelets  arc  best  introduced  with  the  Haar  transform,  because  of  its  simplicity.  At  each  level  l  of  the 

construction  we  keep  track  of  two  sets  of  coefficients,  each  of  which  “looks”  at  a  time  window  of  size 
2i. 


•  Wif  The  smooth  component,  which  consists  of  the  N/21  scaling  coefficients.  These  capture  the 
low-frequency  component  of  the  signal;  in  particular,  the  frequency  range  [0, 1/2*]. 

•  Vi/.  The  detail  component,  which  consists  of  the  N/21  wavelet  coefficients.  These  capture  the 
high-frequency  component;  in  particular,  the  frequency  range  [1/2*,  1/2*-1]. 
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Cascade  gain  (D6) 


Figure  15:  Daubechies-6  cascade  gain  (levels  3-5). 


The  construction  starts  with  Vq  /  =  Xt  and  Wo,t  is  not  defined.  At  each  iteration  l  =  1, 2, . . . ,  lg  N 
we  perform  two  operations  on  V/_  ]  t  to  compute  the  coefficients  at  the  next  level: 

•  Differencing,  to  extract  the  high  frequencies: 

Wi,t  =  (Vj-i,2t  —  H-i^-t)/^ 

•  Smoothing,  by  averaging2  each  consecutive  pair  of  values  to  extract  the  low  frequencies: 

Vitt  =  {Vi-ipt  +  Vi-i,2t-i)  /  V2 

We  stop  when  Wn  consists  of  one  coefficient  (which  happens  at  l  =  lg  N+ 1).  The  scaling  coefficients 
arc  needed  only  during  the  intermediate  stages  of  the  computation.  The  final  wavelet  transform  is  the 
set  of  all  wavelet  coefficients  along  with  Vjgjv+i,o-  Starting  with  MgJV+i,o  (which  is  also  referred  to  as 
the  signal’s  scaling  coefficient)  and  following  the  inverse  steps,  we  can  reconstruct  each  V)  t  until  we 
reach  V0,t  =  Xt. 

Figure  1  illustrates  the  final  effect  for  a  signal  with  N  =  16  values.  Each  wavelet  coefficient  is  the 
result  of  projecting  the  original  signal  onto  the  corresponding  basis  signal. 

In  general,  there  arc  many  wavelet  transforms,  but  they  all  follow  the  pattern  above:  a  wavelet 
transform  uses  a  pair  of  filters,  one  high-pass  and  one  low-pass.  In  the  case  of  the  Haar  transform 
Different  wavelet  families  (e.g.,  Coiflets,  least-asymmetric,  to  mention  a  few)  achieve  different  trade¬ 
offs  with  respect  to  (un)smoothness  of  the  projections,  phase  shift  properties,  etc  [PWOO]. 

“The  scaling  factor  of  l/y/2  in  both  the  difference  and  averaging  operations  is  present  in  order  to  preserve  total  signal 
energy  (i.e.,  sum  of  squares  of  all  values). 
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Frequency  properties  Wavelet  filters  employed  in  practice  can  only  approximate  an  ideal  bandpass 
filter,  since  they  arc  of  finite  length  L.  This  is  an  unavoidable  consequence  of  the  uncertainty  principle. 
The  practical  implications  arc  that  wavelet  coefficients  at  level  l  correspond  roughly  to  frequencies 
[l/2,+l .  1  /2l]  (or,  equivalently,  periods  [2/.  2/+1]  (see  Table  4  and  Figure  15  for  the  actual  correspon¬ 
dence).  This  has  to  be  taken  into  account  for  precise  interpretation  of  AWSOM  models. 


B.l  Wavelet  variance  and  self-similarity 


The  wavelet  variance  decomposes  the  variance  of  a  sequence  across  scales.  Here  we  mention  the 
definitions  and  basic  facts;  details  can  be  found  in  [PWOO]. 


Definition  5  (Wavelet  variance).  If  {Wuf  is  the  DWT  of  a  series  {X, }  then  the  wavelet  variance  Vi 
is  defined  as 

Vi  =  Var  [Wltt\ 

Under  certain  general  conditions 

9 1  N/2' 

*  =  w‘* 
t= t 

is  an  unbiased  estimator  of  V;.  Note  that  the  sum  is  precisely  the  energy  of  {Xt}  at  scale  l. 

Definition  6  (Self-similar  sequence).  A  sequence  { A'/  }  is  said  to  be  self-similar  following  a  pure 
power-law  process  if 

sx{f)  oc  i/r 

where  —  1  <  a  <  0  and  Sx(f)  is  the  SDF 3 


It  can  be  shown  that 


thus  if  { Xj }  is  self-similar,  then 


/•t/21 

Vi  «  2  /  Sx(f)df 
J 1/21+1 


log  Vi  oc  l 


i.e.,  the  plot  of  log  Vi  versus  the  level  l  should  be  linear.  In  fact,  slope  of  the  log-power  versus  scale 
plot  should  be  approximately  equal  to  the  exponent  a.  This  fact  and  how  to  estimate  Vi  are  what  the 
reader  needs  to  keep  in  mind. 


’The  spectral  density  function  (SDF)  is  the  Fourier  transform  of  the  auto-covariance  sequence  (ACVS)  Sx,k  = 
Cov[Xt,  Xt-k}-  Intuitively,  it  decomposes  the  variance  into  frequencies. 


Periods 

Ideal 

Non-zero 

Dominant 

Peak 

4 

16-32 

11-15 

14-28 

23 

5 

32-64 

23-109 

29-57 

45 

6 

64-128 

41-205 

58-111 

91 

7 

128-256 

157-140 

111-212 

181 

Table  4:  Frequency  information  content  of  Daubechies-6  wavelet  coefficients  per  level. 
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C  Recursive  Least  Squares  (RLS) 


Let  us  assume  that  X  is  an  m  x  k  matrix  of  rri  measurements  (one  set  of  k  input  variables  per  row),  b 
is  the  fc  x  1  vector  of  regression  coefficients  and  y  the  m  x  1  vector  of  outputs.  The  LS  solution  to  the 
overdetermined  system  Xb  =  y  is  the  solution  of 

XTXb  =  XTy  (4) 

When  a  new  vector  xm+i  and  output  ym+ 1  arrive,  we  can  update  the  k  x  k  projection  matrix  XT X 
by  adding  the  outer  product  xm+ix^  ,  1  to  it.  Similarly,  we  can  update  X1  y  by  adding  ym+i^-m+i- 
Since  all  we  need  for  the  solution  are 

P  =  XTX  and  q  =  XTy 

we  need  only  space  0(k2  +  k)  =  0(k2)  to  keep  the  model  up  to  date.  In  fact,  it  is  possible  to  update 
the  regression  coefficient  vector  b  at  each  step  without  explicitly  solving  Equation  4  (see  [You84]). 

D  Model  selection 

We  show  how  feature  selection  with  model  combination  can  be  done  from  the  data  gathered  online 
(i.e.,  P  and  q  for  each  AWSOM  equation). 

D.l  Model  testing 

Lemma  3  (Square  sum  of  residuals).  If  b  is  the  least-squares  solution  to  the  overdetermined  equation 
Xh  =  y,  then 

n 

sn  =  X>iTb  -  Vi)2  =  hTPb  -  2bTq  +  y2 

i= 1 

Proof.  Straightforward  from  the  definition  of  sn,  which  in  matrix  form  is  sn  =  (AT)  —  y )2.  □ 

Thus,  besides  P  and  q,  we  only  need  to  update  y2  (a  single  number),  by  adding  y2  to  it  as  each 
new  value  arrives. 

Now,  if  we  select  a  subset  1  =  {q,  *2, . . . ,  ip}  C  {1, 2, . . . ,  k}  of  the  k  variables  xi,X2,  •  •  • ,  Xfc, 
then  the  solution  bj  for  this  subset  is  given  by  Pjhj  =  qj  and  the  SSR  by  sn  =  hj  Fjhj  — 
2bjqx  +  y2  where  the  subscript  X  denotes  straight  row/column  selection  (e.g.,  Pj  =  [p-ir>h}inihei) 
The  F-test  (Fisher  test)  is  a  standard  method  in  statistics  to  determine  whether  a  reduction  in 
variance  is  statistically  significant.  In  particular,  if  /  is  the  ratio  of  the  sample  variance  of  the  simplified 
model  (i.e.,  one  with  fewer  variables)  to  the  sample  variance  of  the  complete  model,  then  the  F- 
distrihution  describes  the  probability  that  /  takes  a  certain  value,  assuming  that  the  difference  is  due 
to  chance. 

The  F-test  is  based  on  the  sample  variances,  which  can  be  computed  directly  from  the  SSR,  as 
explained  in  Lemma  3.  The  F-distribution  holds  precisely  (i.e.,  non-asymptotically)  under  normality 
assumptions.  However,  in  practice  it  works  well  in  most  circumstances,  especially  when  the  population 
size  is  large.  This  is  clearly  the  case  with  semi-infinite  streams. 
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D.2  Model  combination 


If  we  split  measurements  Xi  into  two  subsets  X\  and  X2  with  corresponding  outputs  yi  and  y2,  then 
the  LS  solution  for  both  subsets  combined  is  given  by  b  =  (XT X)~1XTy  where  X  =  [Xf  Xj]T 
andy  =  [yfy^]T,  i.e., 

b  =  (XfX1  +  XjX2)-\Xfyi  +  Xjy2)  =  (. p1  +  P2)~1(q1  +  q2) 

Therefore,  it  is  possible  to  combine  sub-models  when  reducing  the  number  of  levels  (effectively  re¬ 
ducing  the  T  parameter).  Model  selection  as  presented  above  can  be  extended  to  include  this  case. 
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